Search CORE

3 research outputs found

Learning to Separate Voices by Spatial Regions

Author: Choudhury Romit Roy
Xu Zhongweiyang
Publication venue
Publication date: 14/07/2022
Field of study

We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating

4+

sources with 2 microphones) they assume a known or fixed maximum number of sources, K. Moreover, today's models are trained in a supervised manner, using training data synthesized from generic sources, environments, and human head shapes. This paper intends to relax both these constraints at the expense of a slight alteration in the problem definition. We observe that, when a received mixture contains too many sources, it is still helpful to separate them by region, i.e., isolating signal mixtures from each conical sector around the user's head. This requires learning the fine-grained spatial properties of each region, including the signal distortions imposed by a person's head. We propose a two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals, which are then used to train a region-wise separation model. Results show promising performance, underscoring the importance of personalization over a generic supervised approach. (audio samples available at our project website: https://uiuc-earable-computing.github.io/binaural/. We believe this result could help real-world applications in selective hearing, noise cancellation, and audio augmented reality.Comment: Accepted to ICML 2022. For associated audio samples, see https://uiuc-earable-computing.github.io/binaura

arXiv.org e-Print Archive

SpatialCodec: Neural Spatial Speech Coding

Author: Kothapally Vinay
Wang Heming
Xu Yong
Xu Zhongweiyang
Yang Muqiao
Yu Dong
Publication venue
Publication date: 14/09/2023
Field of study

In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.Comment: Paper in Submissio

arXiv.org e-Print Archive

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Author: Wang Heming
Xu Zhongweiyang
Yang Muqiao
Yu Dong
Yu Meng
Zhang Chunlei
Zhang Hao
Zhang Yixuan
Publication venue
Publication date: 16/09/2023
Field of study

Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynthesize clean, anechoic speech from degraded inputs. This study leverages pre-trained vocoder or codec models to synthesize high-quality speech while enhancing robustness in challenging scenarios. Generative methods effectively handle information loss in speech signals, resulting in regenerated speech that has improved fidelity and reduced artifacts. By harnessing the capabilities of pre-trained models, we achieve faithful reproduction of the original speech in adverse conditions. Experimental evaluations on both simulated datasets and realistic samples demonstrate the effectiveness and robustness of our proposed methods. Especially by leveraging codec, we achieve superior subjective scores for both simulated and realistic recordings. The generated speech exhibits enhanced audio quality, reduced background noise, and reverberation. Our findings highlight the potential of pre-trained generative techniques in speech processing, particularly in scenarios where traditional methods falter. Demos are available at https://whmrtm.github.io/SoundResynthesis.Comment: Paper in submissio

arXiv.org e-Print Archive